Cache arrangement for improving raid i/o operations

ABSTRACT

The embodiments of the invention provide a method, apparatus, etc. for a cache arrangement for improving RAID I/O operations. More specifically, a method begins by partitioning a data object into a plurality of data blocks and creating one or more parity data blocks from the data object. Next, the data blocks and the parity data blocks are stored within storage nodes. Following this, the method caches data blocks within a partitioned cache, wherein the partitioned cache includes a plurality of cache partitions. The cache partitions are located within the storage nodes, wherein each cache partition is smaller than the data object. Moreover, the caching within the partitioned cache only caches data blocks in parity storage nodes, wherein the parity storage nodes comprise a parity storage field. Thus, caching within the partitioned cache avoids caching data blocks within storage nodes lacking the parity storage field.

BACKGROUND

1. Field of the Invention

The embodiments of the invention provide a method, apparatus, etc. for acache arrangement for improving RAID I/O operations.

2. Description of the Related Art

It is often necessary in a distributed storage system to read or writedata redundantly that has been striped on more than one storage server(or target). Such a system configuration is referred to as a“network-RAID” (redundant array of independent disks) because thefunction of a RAID controller is performed by the network protocol ofthe distributed storage system by coordinating I/O (input/output)operations that are processed at multiple places concurrently in orderto ensure correct system behavior, both atomically and serially.Distributed storage systems using a network-RAID protocol can process,or coordinate, a network-RAID-protocol I/O request (I/O request) locallyat a client node or the request can be forwarded to a storage server ora coordination server for processing. For example, one client node maylocally write data to a particular data location, while another clientnode may choose to forward a read or a write request for the same datalocation to a shared, or coordination, server.

SUMMARY

The embodiments of the invention provide a method, apparatus, etc. for acache arrangement for improving RAID I/O operations. More specifically,a method for cache management within a distributed data storage systembegins by partitioning a data object into a plurality of data blocks andcreating one or more parity data blocks from the data object. Next, thedata blocks and the parity data blocks are stored within storage nodes.

Following this, the method caches data blocks within a partitionedcache, wherein the partitioned cache includes a plurality of cachepartitions. The cache partitions are located within the storage nodes,wherein each cache partition is smaller than the data object. Moreover,the caching within the partitioned cache only caches data blocks inparity storage nodes, wherein the parity storage nodes comprise a paritystorage field. Thus, caching within the partitioned cache avoids cachingdata blocks within storage nodes lacking the parity storage field. Whenthe storage nodes comprise more than one parity storage node, the datablocks are cached in any of the parity storage nodes.

The method further includes updating the data object. Specifically, awrite request is annotated with information regarding changed datablocks within the data object; and, the write request is only sent tothe parity storage nodes. The sending of the write request only to theparity storage nodes comprises simultaneously performing an invalidationoperation and a write operation. Subsequently, the data blocks andparity data block are read from the storage nodes.

An apparatus for cache management within a distributed data storagesystem is also provided. More specifically, the apparatus comprises apartitioner to partition a data object into a plurality of data blocks.An analysis engine is operatively connected to the partitioner, whereinthe analysis engine creates one or more parity data blocks from the dataobject. Moreover, a controller is operatively connected to the analysisengine, wherein the controller stores the data blocks and the paritydata blocks within storage nodes.

The controller also caches data blocks within a partitioned cache,wherein the partitioned cache includes a plurality of cache partitions.The cache partitions are located within the storage nodes, wherein eachcache partition is smaller than the data object. When caching within thepartitioned cache, the controller only caches data blocks in paritystorage nodes, wherein the parity storage nodes have a parity storagefield. Thus, when caching, the controller avoids caching data blockswithin storage nodes lacking the parity storage field. When the storagenodes have more than one parity storage node, the controller caches thedata blocks in any of the parity storage nodes.

Additionally, the controller annotates a write request with informationregarding changed data blocks within the data object and sends the writerequest to the parity storage nodes. The controller simultaneouslyperforms an invalidation operation and a write operation. The apparatusfurther includes a reader operatively connected to the controller,wherein the reader reads the data blocks and the parity data blocks fromthe storage nodes.

Accordingly, the embodiments of the invention provide a technique tobuild a lightweight cache coherence protocol in distributed storagesystems by exploiting the update patterns inherent with erasure codeddata. Such a technique can unify the partitioned caches into a singlelarge read cache and can use the cached data to improve RAID I/Ooperations from clients.

BRIEF DESCRIPTION OF THE DRAWINGS

The embodiments of the invention will be better understood from thefollowing detailed description with reference to the drawings, in which:

FIG. 1 is a table illustrating benefits of caching while executing writeand reconstruct read operations;

FIG. 2 is a table illustrating an enumeration of the type of plansgenerated by the embodiments of the invention;

FIGS. 3A and 3B are diagrams illustrating two variants of I/O updatetopology for distributed RAID that keep data in sync;

FIGS. 4A, 4B, 4C, and 4 d are diagrams illustrating four ways to primethe cache at the parity nodes to improve RAID I/O operations indistributed RAID storage systems;

FIG. 5 is a diagram illustrating a system for a cache arrangement forimproving RAID I/O operations;

FIG. 6 is a diagram illustrating a data object stripe;

FIGS. 7A and 7B are diagrams illustrating cache arrangement forimproving RAID I/O operations;

FIG. 8 is a diagram illustrating an apparatus for cache managementwithin a distributed data storage system; and

FIG. 9 is a flow diagram illustrating a method for cache managementwithin a distributed data storage system.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

The embodiments of the invention and the various features andadvantageous details thereof are explained more fully with reference tothe non-limiting embodiments that are illustrated in the accompanyingdrawings and detailed in the following description. It should be notedthat the features illustrated in the drawings are not necessarily drawnto scale. Descriptions of well-known components and processingtechniques are omitted so as to not unnecessarily obscure theembodiments of the invention. The examples used herein are intendedmerely to facilitate an understanding of ways in which the embodimentsof the invention may be practiced and to further enable those of skillin the art to practice the embodiments of the invention. Accordingly,the examples should not be construed as limiting the scope of theembodiments of the invention.

The embodiments of the invention provide a technique to build alightweight cache coherence protocol in distributed storage systems byexploiting the update patterns inherent with erasure coded data. Such atechnique can unify the partitioned caches into a single large readcache and can use the cached data to improve RAID I/O operations fromclients.

Erasure coded data benefits the most from caching while executing writeand reconstruct read operations. FIG. 1 illustrates a table showing thebenefits. Specifically, an example of savings with the embodiments ofthe invention is shown when the underlying distributed RAID layout isRAID5 over 4 nodes. The savings comes from exploiting the cache state atvarious nodes of a distributed RAID system. Pages for a given stripecould be in the read cache at one or more parity node(s), data nodesand/or client nodes. Embodiments herein can deliver such savings whenthe working set exceeds the total cache size of a single client node.Brick systems may have more (aggregate) cache space fronting the drivesas comparable RAID controllers. Phrased another way, for the same costof the system, more aggregate cache can be included in a brick systemthan in a monolithic system.

To make effective use, dispersed cache requires some cache coherencescheme, which comprises of two parts. First, a scalable cache directoryneeds to map pages to nodes. Second, an invalidation (or coherence)protocol is needed to ensure correctness. With erasure codes, read/writeperformance of data in degraded/critical mode is significantly slowerthan under fault-free mode. If at least the working set is cachedsomewhere until the rebuild operation completes, then read/writeperformance can be improved. Specifically, given a cache arrangementscheme, the execution of a RAID read or write operation at a node can beoptimized by leveraging pages that are in the caches of the differentnodes.

Considering data laid out in some erasure code layout (e.g., RAID5), foreach data stripe, a subset of the bricks take on different roles. Eachbrick stores a stripe of data for which it is the target node (TN). Foreach stripe, there will be at least t pages to store parity for at-fault tolerant code. Each parity page is stored on a different paritynode (PN). Client nodes (CN) are also provided. From the perspective ofany dirty data page, the multiple nodes in the system are categorized asdescribed below. CN is the client node that initiates the flush of thisdirty page; and, TN is the target node to which the dirty data pageshould be written. {PN} is the parity node that hosts the parity pagethat depends on the dirty data page. There can be multiple paritiesdepending on the layout, which is indicated by the curved brackets. {DN}is the dependent node that hosts the dependent data (dD) contributing tothe calculation of the same parity as the dirty page.

The XOR calculations for new parity can be performed at any one orcombination of these nodes. Locally, each of the above nodes can haveone of two plans: parity compute (PC) or parity increment (PI).Additionally, two issues need to be addressed. The first issue is howeach kind of nodes derive their own best I/O plan. The second issue ishow different nodes interact with each other to get an agreement on thefinal I/O plan.

FIG. 2 illustrates a table, which enumerates all possible I/O planspossible amongst these nodes for a given dirty page. The overarchingnotation is that a write changes D_(old) to D_(new) which requiresupdating the relevant parity page from P_(old) to P_(new). In someschemes, a partial parity is used as Δ=D_(new)xor D_(old). Next, amethod is presented to derive the best local I/O plan and thecommunication protocol to allow different nodes to reach an agreement onthe final I/O plan.

Data pages can be cached only at parity nodes that depend on it. When anupdate to the data page occurs (at CN) the invalidation can bepiggybacked on the that operation to the new parity page (to PN). PN isguaranteed to get an update operation due to how redundancy ismaintained i.e., erasure coding. In other words, if data pages arecached at the parity node(s), the new data is always in the paritynodes. This can be checked during read to that data by any CN. Theunchanged data, which is not in the parity nodes, are not invalidated.

Beyond just invalidation, by employing certain client write I/O plans,this cache at the parity node(s) can be kept in sync without any extramessaging. FIGS. 3A and 3B illustrate two such I/O plans (each employingthe parity increment with Δ). Specifically, in FIG. 3A, CN writes newdata to the target node, computes Δ, and ships it to the affected paritynodes to be applied. In FIG. 3B, CN writes new data to the parity nodewith old data. This parity node computes Δ and ships it to the targetand other parity nodes to be applied.

As illustrated in FIGS. 4A, 4B, 4C, and 4D, four alternatives areprovided to describe how the parity node(s) gather data pages fromclient or target nodes. In FIG. 4A, the target node (in response to aclient read) ships the data to one or more parity nodes. In FIG. 4B, theclient demotes a clean page it would have discarded to one or moreparity nodes. Further, in FIG. 4C, the target demotes the page to theparity node. In FIG. 4D, the parity node asynchronously reads pages fromthe target node.

If both TN and one or more PNs cache a data block, the effective cachesize is reduced. This leads to greater cache pressure on (global pool)cache pages. To avoid this, three rules for caching data are provided.First, TN does not read cached data pages except during systemtransience (writing, buffering). This makes TN's cache exclusive.Second, when the erasure code allows for multiple PN's, then any one canbe chosen (e.g., randomly). Third, the first rule is not applicable toparity pages, which can cached during transience.

With this caching scheme in place, embodiments herein can use one roundof messages to gather all candidate I/O plan costs from all t PN's andcompare with the local plans available to CN and pick the best plan. Indegraded/critical mode, reconstructed pages are held at the parity nodelonger (until rebuild completes or cache pressure builds sufficiently)for possible reuse by another client. As discussed above, if at leastthe working set is cached somewhere until the rebuild operationcompletes, then read/write performance can be improved. Specifically,given a cache arrangement scheme, the execution of a RAID read or writeoperation at a node can be optimized by leveraging pages that are in thecaches of the different nodes.

Thus, while cache invalidation is piggybacked on write operations,priming caches at the parity nodes takes some extra work. Moreover, readoperations will need two phases, including a first phase to exchangeplans. Write operations may require 3 phases, including a first phase toexchange plans (but here is an opportunity to piggyback). Further, bylocation shifting the cache, the impact it will have on local I/Ooptimizations (like prefetching etc.) is unknown.

The embodiments herein can be applied to distributed (clustered) storagesystems. For such systems, the embodiments of the invention have theability to provide read cache unification and to improve RAID I/Ooperations.

Furthermore, the embodiments of the invention provide a distributedcache management scheme for a storage system that uses erasure codeddistributed RAID and has partitioned cache (where the total sum can befairly substantial). This speeds up RAID reads and writes by leveragingcached data, where possible. Moreover, this unifies the cache, whichmaximizes cache effectiveness. There is no duplication of cached data.The cache management scheme is lightweight; no (additional) messagingfor cache coherence or a data directory is needed. The management schemeis also opportunistic; any steps can be skipped under a heavy loadwithout affecting correctness.

FIG. 5 is a diagram illustrating a system for such a cache arrangementscheme. The initiator for read or write operations to the dRAID volumecan be at a client node 510A or 510B (direct access) or a storage node520A or 520B (gateway). Meta-data 530 is available to the initiator viaa network 540. The storage nodes 520A/520B could have a write and readcache or a read cache only (cache 522A/522B). A dRAIDed stripe is spreadacross the storage nodes 520A/520B, wherein the system assumes uniformlyspread storage.

FIG. 6 is a diagram illustrating a data object stripe within fivestorage nodes (SN1, SN2, SN3, SN4, and SN5). The data object stripeincludes a first data block (D1), a second data block (D2), a third datablock (D3), a fourth data block (D4), and a parity block (P). The roleof a storage node for a data block can be a client node (CN), paritynode (PN), or target node (TN). Each storage node can play multipleroles for different blocks. Thus, SN3 is the target node for D3; SN5 isthe parity node; and, any of the storage nodes can be a client node.

Embodiments of the invention provide the following cache rules. First,each write request from a client is annotated with information aboutchanged blocks within a stripe. Thus, cache invalidation is piggybackedonto regular operations. Second, data blocks can be cached only atparity node(s). Multiple candidates exist for higher distance codes;and, no separate cache directory is needed. Third, data blocks are notcached at the target node, except by the operating system as stagingduring read/write operations. The “home” location of data is shiftedfrom a target node to a parity node. Fourth, clients “demote” victimdata page to parity node(s). In case of a higher distance code, alexicographical parity node is chosen. Such a parity node primes cachesin storage nodes opportunistically from clients. Fifth, a client orstorage node can locally decide to evict (clean) pages. This providesfor loosely coupled caching.

Consequences of the cache rules provide that data pages from multipleclients get “percolated” into caches in storage nodes, which isadvantageous for shared workloads without clients even cooperating. Thisis irrelevant for totally random workloads, which are no worse thanbefore. Moreover, caches at storage nodes are aligned in a“RAID-friendly” way. All data used to compute a parity block localized.Further, due to the nature of erasure code updates, cache coherence isfree. Parity node(s) have to be written to for write completion.Annotation helps identify which blocks have changed.

FIGS. 7A and 7B are diagrams illustrating cache arrangement forimproving RAID I/O operations. FIG. 7A illustrates storage node 1 (SN1),which includes data blocks 1, 6, and 11. Storage node 2 (SN2) includesdata blocks 2, 7, and 12; and, storage node 3 (SN3) has data blocks 3and 8, and parity block 3 (P3). Additionally, storage node 4 (SN4)includes data blocks 4 and 9, and parity block 2 (P2); and, storage node5 (SN5) has data blocks 5 and 10, and parity block 1 (P1). Thus, asillustrated in FIG. 7B, data blocks are only cached in storage nodeshaving parity blocks (i.e., SN3, SN4, and SN5).

Reads and writes include an extra messaging phase to query the cachestate at parity node(s). Client costs various read/update plans possiblearound metrics, such as disk IOs and memory bandwidth. The clientchooses the best plan and drives I/O.

Read plan choices include finding the cheapest reconstruction plan inthree steps: inverting the matrix; masking cached pages; and, costplanning. Possible locations include the client node and parity node(s).

Beyond distributed RAID, the embodiments herein are applicable to aclass of problems that requires coordination of a distributed cacheresource and updates to a set of data blocks that require updates tosome common (dependent) block(s). Such systems could include distributeddatabases and cluster file systems.

Thus, the embodiments of the invention provide a distributed cachearrangement for a storage system that speeds up RAID operations whereworkload is conducive. The working set is larger than any single clientcache but it fits in the collective cache. A shared data set existsbetween the clients but the data set is time shifted. Moreover, thecache arrangement adjusts automatically to workloads from clients. Ifthere is a shared workload, then there is a benefit; otherwise, thecache arrangement exploits collective cache space.

Referring to FIG. 8, an apparatus 800 for cache management within adistributed data storage system is illustrated. More specifically, apartitioner 810 is provided to partition a data object into a pluralityof data blocks. An analysis engine 820 is operatively connected to thepartitioner 810, wherein the analysis engine 820 creates one or moreparity data blocks from the data object. For example, as illustrated inFIG. 6, the data object stripe includes a first data block (D1), asecond data block (D2), a third data block (D3), a fourth data block(D4), and a parity block (P). Furthermore, a controller 830 isoperatively connected to the analysis engine 820, wherein the controller830 stores the data blocks and the parity data block within storagenodes. For example, as illustrated in FIGS. 7A and 7B, the data blocks1-12 and the parity data blocks P1-P3 are stored within the storagenodes SN1-SN5.

The controller 830 also caches the data blocks within a partitionedcache, wherein the partitioned cache includes cache partitions. Thecache partitions are located within the storage nodes, wherein eachcache partition is smaller than the data object (e.g., volume, LUN, filesystem). More specifically, each cache partition is located within astorage node. When caching within the partitioned cache, the controller830 only caches the data blocks in parity storage nodes, wherein theparity storage nodes include a parity storage field (a field within astorage node where parity data block(s) can be stored). Thus, thecontroller 830 avoids caching data blocks within storage nodes lackingthe parity storage field. For example, as illustrated in FIGS. 7A and7B, data blocks 1-12 are only cached within the storage nodes havingstored parity data blocks. In this example, parity data blocks P1, P2,and P3 are stored in storage nodes SN5, SN4, and SN3, respectively.

When caching within the partitioned cache, and when the storage nodescomprise more than one parity storage node, the controller 830 cachesthe data blocks in any of the parity storage nodes. Moreover, thecontroller 830 annotates a write request with information regardingchanged data blocks within the data object and sends the write requestto the parity storage nodes. The controller 830 simultaneously performsan invalidation operation and a write operation. Additionally, a reader840 is operatively connected to the controller 830, wherein the reader840 reads the data blocks and the parity data block from the storagenodes.

Referring to FIG. 9, a method 900 for cache management within adistributed data storage system is illustrated. More specifically, themethod 900 begins in item 910 by partitioning a data object into datablocks. Next, in item 920, one or more parity data blocks are createdfrom the data object. As described above, FIG. 6 illustrates a dataobject stripe having a first data block (D1), a second data block (D2),a third data block (D3), a fourth data block (D4), and a parity block(P). Following this, in item 930, the data blocks and the parity datablock are stored within storage nodes. As described above, the role of astorage node for a data block can be a client node (CN), a parity node(PN), or a target node (TN). Each storage node can play multiple rolesfor different blocks.

In item 940, the data blocks are also cached within a partitioned cache,wherein the partitioned cache includes cache partitions. The cachepartitions are located within the storage nodes, wherein each cachepartition is smaller than the data object. As described above, thestorage nodes could have a write and read cache or a read cache only.Moreover, the caching within the partitioned cache only caches the datablocks in parity storage nodes, wherein the parity storage nodes includea parity storage field (item 942). Thus, caching the data blocks withinstorage nodes lacking the parity storage field is avoided (item 944).Accordingly, as described above, a separate cache directory is notrequired because the cached data blocks are only in the parity storagenodes.

When caching the data blocks within the partitioned cache, and when thestorage nodes have more than one parity storage node, the data blocksare cached in any of the parity storage nodes (item 946). As describedabove, FIGS. 4A, 4B, 4C, and 4D illustrate four alternatives to describehow the parity node(s) gather data pages from client or target nodes. InFIG. 4A, the target node (in response to a client read) ships the datato one or more parity nodes. In FIG. 4B, the client demotes a clean pageit would have discarded to one or more parity nodes. Further, in FIG.4C, the target demotes the page to the parity node. In FIG. 4D, theparity node asynchronously reads pages from the target node.

The method 900 also includes, in item 950, updating the data object.This includes annotating a write request with information regardingchanged data blocks within the data object (item 952) and sending thewrite request only to the parity storage nodes (item 954). The sendingof the write request only to the parity storage nodes comprisessimultaneously performing an invalidation operation and a writeoperation (item 956). Thus, as described above, cache invalidation ispiggybacked onto regular operations. Due to the nature of erasure codeupdating, cache coherence is free because parity node(s) have to bewritten to for a write completion. Annotation helps identify whichblocks have changed. Subsequently, in item 960, the data blocks andparity data block are read from the storage nodes. The method 900 cancheck the cache at the parity storage nodes before reading the datablock from the target storage nodes.

Accordingly, the embodiments of the invention provide a technique tobuild a lightweight cache coherence protocol in distributed storagesystems by exploiting the update patterns inherent with erasure codeddata. Such a technique can unify the partitioned caches into a singlelarge read cache and can use the cached data to improve RAID I/Ooperations from clients.

The foregoing description of the specific embodiments will so fullyreveal the general nature of the invention that others can, by applyingcurrent knowledge, readily modify and/or adapt for various applicationssuch specific embodiments without departing from the generic concept,and, therefore, such adaptations and modifications should and areintended to be comprehended within the meaning and range of equivalentsof the disclosed embodiments. It is to be understood that thephraseology or terminology employed herein is for the purpose ofdescription and not of limitation. Therefore, while the embodiments ofthe invention have been described in terms of preferred embodiments,those skilled in the art will recognize that the embodiments of theinvention can be practiced with modification within the spirit and scopeof the appended claims.

1-7. (canceled)
 7. A method for cache management within a distributeddata storage system, said method comprising: partitioning a data objectinto a plurality of data blocks; creating at least one parity data blockfrom said data object; storing said data blocks and said parity datablock within storage nodes; caching said data blocks within apartitioned cache, wherein said partitioned cache comprises a pluralityof cache partitions, wherein said cache partitions are located withinsaid storage nodes, wherein said caching within said partitioned cacheonly caches said data blocks in parity storage nodes, wherein saidparity storage nodes comprise a parity storage field; updating said dataobject, said updating comprising annotating a write request withinformation regarding changed data blocks within said data object, andsending said write request only to said parity storage nodes; andreading said data blocks and said parity data block from said storagenodes; wherein said caching within said partitioned cache comprisesavoiding caching said data blocks within storage nodes lacking saidparity storage field, wherein said sending of said write request only tosaid parity storage nodes comprises simultaneously performing aninvalidation operation and a write operation, and wherein said cachingof said data blocks within said partitioned cache comprises, when saidstorage nodes comprise more than one of said parity storage nodes,caching said data blocks in any of said parity storage nodes. 8-20.(canceled)